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Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is 
one of several approaches for tolerating memory latencies. Prefetching can be either 
hardware-based or software-directed or a combination of both. Hardware-based 
prefetching, requiring some support unit connected to the cache, can dynamically handle 
prefetches at run-time without compiler intervention. Software-directed approaches rely on 
compiler technology to insert explicit prefetch instructions. Mowry ... 
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Despite large caches, main-memory access latencies still cause significant performance 
losses in many applications. Numerous hardware and software prefetching schemes have 
been proposed to tolerate these latencies. Software prefetching typically provides better 
prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the 
compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two 
cache miss latencies. Hardware prefetching can be e ... 
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Beyond data caching, data prefetching is by far the most effective way to address the 
memory access bottieneck associated with high-performance processors. This is particuiariy 
true for scientific programs whose working sets cannot be easily fit into the on-chip data 
cache. This paper proposes a new data prefetching architecture called Sunder, which 
combines the flexibility and accurateness of software prefetching and the transparency and 
low-overhead of hardware prefetching. Th ... 
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We present a new hardware-based data prefetching mechanism for enhancing instruction 
level parallelism and improving the performance of superscalar processors. The emphasis in 
our scheme is on the effective utilization of slack time and hardware resources not used for 
the main computation. The scheme suggests a new hardware construct, the program 
progress graph (PPG), as a simple extension to the branch target buffer (BTB). We use the 
PPG for implementing a fast pre-program counter pre-PC, that ... 

Keywords: LRU mechanism, SPEC92 benchmark, Tango, base line architecture, branch 
target buffer, hardware resources, hardware-based data prefetching technique, instruction 
level parallelism, instruction prefetching, memory reference instructions, parallel 
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Memory system bottlenecks limit performance for many applications, and computations 
with strided access patterns are among the hardest hit. The streams used in such 
applications have extremely poor cache behavior. These access patterns have the 
advantage of being predictable, though, and this can be exploited to improve the efficiency 
of the memory subsystem in two ways: memory latencies can be masked by prefetching 
stream data, and the latencies can be reduced by reordering stream accesses ... 
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The expanding gap between microprocessor and DRAM performance has necessitated the 
use of increasingly aggressive techniques designed to reduce or hide the latency of main 
memory access. Although large cache hierarchies have proven to be effective in reducing 
this latency for the most frequently used data, it is still not uncommon for many programs 
to spend more than half their run times stalled on memory requests. Data prefetching has 
been proposed as a technique for hiding the access lat ... 
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Current techniques for prefetching linked data structures (LDS) exploit the work available in 
one loop iteration or recursive call to overlap pointer chasing latency. Jump pointers, which 
provide direct access to non-adjacent nodes, can be used for prefetching when loop and 
recursive procedure bodies are small and do not have sufficient work to overlap a long 
latency. This paper describes a framework for jump-pointer prefetching (JPP) that supports 
four prefetching idioms: queue, full, chain, an ... 

10 A general framework for prefetch scheduling in linked data structures and its 
a pplication to multi-chain prefetching 

Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, Donald Yeung 
May 2004 ACM Transactions on Computer Systems (TOCS), Volume 22 issue 2 

Full text available: ^ pdf(2.45 MB) Additional Information: full citation , abstract , references , index terms 

Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads to 
the serialization of memory operations, the traversal of independent pointer chains provides 
a source of memory parallelism. This article investigates exploiting such Interchain memory 
parallelism for the purpose of memory latency tolerance, using a technique called multi- 
chain prefetching. Previous work ... 
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approach 

February 2001 ACM Transactions on Computer Systems (TOCS), Volume 19 issue l 

Full text available: f*pdf(432.96 KB) Additional Information: full citation , abstract references, index terms, 
^ review 

Instruction cache miss latency is becoming an increasingly important performance 
bottleneck, especially for commercial applications. Although instruction prefetching is an 
attractive technique for tolerating this latency, we find that existing prefetching schemes 
are insufficient for modern superscalar processors, since they fail to issue prefetches early 
enough (particularly for nonsequential accesses). To overcome these limitations, we 
propose a new instruction prefetching technique where .,. 

Keywords: compiler optimization, instruction prefetching 
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This paper explores Speculative Precomputation, a technique that uses idle thread context 
in a multithreaded architecture to improve performance of single-threaded applications. It 
attacks program stalls from data cache misses by pre-computing future memory accesses 
in available thread contexts, and prefetching these data. This technique is evaluated by 
simulating the performance of a research processor based on the Itanium™ ISA supporting 
Simultaneous Multithreading. Two primary for ... 

14 Call graph prefetching for database applications 
Murali Annavaram, Jignesh M. Patel, Edward S. Davidson 

November 2003 ACM Transactions on Computer Systems (TOCS), volume 21 issue 4 
Full text available: ^pdf(701 .71 KB) Additional Information: full citation , abstract , references , index terms 

With the continuing technological trend of ever cheaper and larger memory, most data sets 
in database servers will soon be able to reside in main memory. In this configuration, the 
performance bottleneck is likely to be the gap between the processing speed of the CPU and 
the memory access latency. Previous work has shown that database applications have large 
instruction and data footprints and hence do not use processor caches effectively. In this 
paper, we propose Call Graph Prefetching (CGP), ... 
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annual international symposium on Computer architecture, volume 23 issue 2 
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While many parallel applications exhibit good spatial locality, other important codes in areas 
like graph problem-solving or CAD do not. Often, these irregular codes contain small 
records accessed via pointers. Consequently, while the former applications benefit from 
long cache lines, the latter prefer short lines. One good solution is to combine short lines 
with prefetching. In this way, each application can exploit the amount of spatial locality that 
it has. However, prefetching, if provided, ... 

16 DSTRIDE: data-cache miss-address-based stride prefetching scheme for multimedia 
processors 

Hariprakash. G, Achutharaman. R, Amos R. Omondi 

January 2001 Australian Computer Science Communications , Proceedings of the 6th 
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4 

pdf(928.14KB) 

^Additional Information: full citation , abstract , references 
Publisher Site 

Prefetching reduces cache miss latency by moving data up in memory hierarchy before they 
are actually needed. Recent hardware-based stride prefetching techniques mostly rely on 
the processor pipeline information (e.g. program counter and branch prediction table) for 
prediction. Continuing developments in processor microarchitecture drastically change core 
pipeline design and require that existing hardware-based stride prefetching techniques be 
adapted to the evolving new processor architectures. ... 
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A hardware prefetching mechanism named Speculative Prefetching is proposed. This 
scheme detects vector accesses issued by a load/store instruction and prefetches the 
corresponding data. The scheme requires no software add-on, and in some cases it is more 
powerful than software techniques for identifying regular accesses. The tradeoffs related to 
its hardware implementation are extensively discussed in order to finely tune the 
mechanism. Experiments show that average memory ... 

18 The interaction of software prefetching with ILP processors in shared-memory systems Q 
Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, Sarita V. Adve 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture, Volume 25 issue 2 
Full text available* f£\ pdf(2 44 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

Current microprocessors aggressively exploit instruction-level parallelism (ILP) through 
techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent 
work has shown that memory latency remains a significant performance bottleneck for 
shared-memory multiprocessor systems built of such processors. This paper provides the 
first study of the effectiveness of software-controlled non-binding prefetching in shared 
memory multiprocessors built of state-of-the-art ILP-based proces ... 
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Mark D. Hill, James R. Larus, Steven K. Reinhardt, David A. Wood 
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We believe the absence of massively-parallel, shared-memory machines follows from the 
lack of a shared-memory programming performance model that can inform programmers of 
the cost of operations (so they can avoid expensive ones) and can tell hardware designers 
which cases are common (so they can build simple hardware to optimize them). 
Cooperative shared memory, our approach to shared-memory design, addresses this 
problem. Our initial implementation of cooperativ ... 
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We believe the paucity of massively parallel, shared-memory machines follows from the 
lack of a shared-memory programming performance model that can inform programmers of 
the cost of operations (so they can avoid expensive ones) and can tell hardware designers 
which cases are common (so they can build simple hardware to optimize them). 
Cooperative shared memory, our approach to shared-memory design, addresses this 
problem. Our initial implementation of cooperative shared memory u ... 
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We present a new hardware-based data prefetching mechanism for enhancing instruction 
level parallelism and improving the performance of superscalar processors. The emphasis in 
our scheme is on the effective utilization of slack time and hardware resources not used for 
the main computation. The scheme suggests a new hardware construct, the program 
progress graph (PPG), as a simple extension to the branch target buffer (BTB). We use the 
PPG for implementing a fast pre-program counter pre-PC, that ... 
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Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is 
one of several approaches for tolerating memory latencies. Prefetching can be either 
hardware-based or software-directed or a combination of both. Hardware-based 
prefetching, requiring some support unit connected to the cache, can dynamically handle 
prefetches at run-time without compiler intervention. Software-directed approaches rely on 
compiler technology to insert explicit prefetch instructions. Mowry ... 
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As microprocessors become faster, the relative performance cost of memory accesses 
increases. Bigger and faster caches significantly reduce the absolute load-to-use time delay. 
However, increase in processor operational frequencies impairs the relative load-to-use 
latency, measured in processor cycles (e.g. from two cycles on the Pentium&reg; processor 
to three cycles or more in current designs). Load-address prediction techniques were 
introduced to partially cut the load-to-use latency. Thi ... 
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Improvements in main memory speeds have not kept pace with increasing processor clock 
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frequency and improved exploitation of instruction-level parallelism. Consequently, the gap 
between processor and main memory performance is expected to grow, increasing the 
number of execution cycles spent waiting for memory accesses to complete. One solution to 
this growing problem is to reduce the number of cache misses by increasing the 
effectiveness of the cache hierarchy. In this paper we present a techni ... 
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Software-controlled data prefetching offers the potential for bridging the ever-increasing 
speed gap between the memory subsystem and today's high-performance processors. 
While prefetching has enjoyed considerable success in array-based numeric codes, its 
potential in pointer-based applications has remained largely unexplored. This paper 
investigates compiler-based prefetching for pointer-based applications— in particular, those 
containing recursive data structures. We identify the fundamental ... 
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Today's commodity microprocessors require a low latency memory system to achieve high 
sustained performance. The conventional high-performance memory system provides fast 
data access via a large secondary cache. But large secondary caches can be expensive, 
particularly in large-scale parallel systems with many processors (and thus many 
caches). We evaluate a memory system design that can be both cost-effective as well as 
provide better performance, particularly for scientific workloads: a single ... 
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We introduce a dynamic scheme that captures the accesspat-terns of linked data structures 
and can be used to predict future accesses with high accuracy. Our technique exploits the 
dependence relationships that exist between loads that produce addresses and loads that 
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consume these addresses. By identzj+ing producer-consumer pairs, we construct a 
compact internal representation for the associated structure and its traversal. To achieve a 
prefetching eflect, a small prefetch engine speculatively t ... 
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The large latency of memory accesses in large-scale shared-memory multiprocessors is a 
key obstacle to achieving high processor utilization. Software-controlled prefetching is a 
technique for tolerating memory latency by explicitly executing instructions to move data 
close to the processor before the data are actually needed. To minimize the burden on the 
programmer, compiler support is needed to automatically insert prefetch instructions into 
the code. A key challenge when ... 
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